Goto

Collaborating Authors

 response selection


Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems

arXiv.org Artificial Intelligence

Recently, knowledge-grounded conversations in the open domain gain great attention from researchers. Existing works on retrieval-based dialogue systems have paid tremendous efforts to utilize neural networks to build a matching model, where all of the context and knowledge contents are used to match the response candidate with various representation methods. Actually, different parts of the context and knowledge are differentially important for recognizing the proper response candidate, as many utterances are useless due to the topic shift. Those excessive useless information in the context and knowledge can influence the matching process and leads to inferior performance. To address this problem, we propose a multi-turn \textbf{R}esponse \textbf{S}election \textbf{M}odel that can \textbf{D}etect the relevant parts of the \textbf{C}ontext and \textbf{K}nowledge collection (\textbf{RSM-DCK}). Our model first uses the recent context as a query to pre-select relevant parts of the context and knowledge collection at the word-level and utterance-level semantics. Further, the response candidate interacts with the selected context and knowledge collection respectively. In the end, The fused representation of the context and response candidate is utilized to post-select the relevant parts of the knowledge collection more confidently for matching. We test our proposed model on two benchmark datasets. Evaluation results indicate that our model achieves better performance than the existing methods, and can effectively detect the relevant context and knowledge for response selection.


Whom to Respond To? A Transformer-Based Model for Multi-Party Social Robot Interaction

arXiv.org Artificial Intelligence

Prior human-robot interaction (HRI) research has primarily focused on single-user interactions, where robots do not need to consider the timing or recipient of their responses. However, in multi-party interactions, such as at malls and hospitals, social robots must understand the context and decide both when and to whom they should respond. In this paper, we propose a Transformer-based multi-task learning framework to improve the decision-making process of social robots, particularly in multi-user environments. Considering the characteristics of HRI, we propose two novel loss functions: one that enforces constraints on active speakers to improve scene modeling, and another that guides response selection towards utterances specifically directed at the robot. Additionally, we construct a novel multi-party HRI dataset that captures real-world complexities, such as gaze misalignment. Experimental results demonstrate that our model achieves state-of-the-art performance in respond decisions, outperforming existing heuristic-based and single-task approaches. Our findings contribute to the development of socially intelligent social robots capable of engaging in natural and context-aware multi-party interactions.


On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval

arXiv.org Artificial Intelligence

Multimodal chatbots have become one of the major topics for dialogue systems in both research community and industry. Recently, researchers have shed light on the multimodality of responses as well as dialogue contexts. This work explores how a dialogue system can output responses in various modalities such as text and image. To this end, we first formulate a multimodal dialogue response retrieval task for retrieval-based systems as the combination of three subtasks. We then propose three integration methods based on a two-step approach and an end-to-end approach, and compare the merits and demerits of each method. Experimental results on two datasets demonstrate that the end-to-end approach achieves comparable performance without an intermediate step in the two-step approach. In addition, a parameter sharing strategy not only reduces the number of parameters but also boosts performance by transferring knowledge across the subtasks and the modalities.


Multi-Party Conversational Agents: A Survey

arXiv.org Artificial Intelligence

Multi-party Conversational Agents (MPCAs) are systems designed to engage in dialogue with more than two participants simultaneously. Unlike traditional two-party agents, designing MPCAs faces additional challenges due to the need to interpret both utterance semantics and social dynamics. This survey explores recent progress in MPCAs by addressing three key questions: 1) Can agents model each participants' mental states? (State of Mind Modeling); 2) Can they properly understand the dialogue content? (Semantic Understanding); and 3) Can they reason about and predict future conversation flow? (Agent Action Modeling). We review methods ranging from classical machine learning to Large Language Models (LLMs) and multi-modal systems. Our analysis underscores Theory of Mind (ToM) as essential for building intelligent MPCAs and highlights multi-modal understanding as a promising yet underexplored direction. Finally, this survey offers guidance to future researchers on developing more capable MPCAs.


Enhancing Dialogue Systems with Discourse-Level Understanding Using Deep Canonical Correlation Analysis

arXiv.org Artificial Intelligence

Dialogue systems, such as chatbots or virtual assistants, have m ade substantial progress in generating contextually appropriate responses. How ever, these systems face a persistent challenge in maintaining coherence and releva nce across multiple turns in longer conversations. This is especially difficult when th e context becomes complex, with numerous topics, nuanced reference s, or shifting conversational goals. With the objective of enhanced language mo deling, such models often struggle to effectively utilize the entire discourse histo ry, leading to responses that may be locally appropriate but globally inconsistent o r irrelevant [8] The core issue is how dialogue systems manage and interpret discour se history. Current models typically rely on the immediate context (e.g., th e last few utterances) to generate responses, which can lead to a loss of imp ortant information from earlier in the conversation. This limitation becomes more pro nounced 1 in longer dialogues, where the context is spread across many turns and may involve intricate dependencies between utterances.


A Diverse and Effective Retrieval-Based Debt Collection System with Expert Knowledge

arXiv.org Artificial Intelligence

Designing effective debt collection systems is crucial for improving operational efficiency and reducing costs in the financial industry. However, the challenges of maintaining script diversity, contextual relevance, and coherence make this task particularly difficult. This paper presents a debt collection system based on real debtor-collector data from a major commercial bank. We construct a script library from real-world debt collection conversations, and propose a two-stage retrieval based response system for contextual relevance. Experimental results show that our system improves script diversity, enhances response relevance, and achieves practical deployment efficiency through knowledge distillation. This work offers a scalable and automated solution, providing valuable insights for advancing debt collection practices in real-world applications.


EVOLVE: Emotion and Visual Output Learning via LLM Evaluation

arXiv.org Artificial Intelligence

Additionally, this kind of subdivided action While the ability to effectively communicate and retain schema can be used to evaluate many attributes towards user attention for longer periods of time is important in many promoting empathetic responses, including tone of voice, HRI settings, eliciting an impression of empathy through nonverbal cues, and facial expressions [6]. However, atomic nonverbal behavior can be critical to acceptance of and trust actions with limited sentiments might not be sufficient to in social robots [1]. Through a comprehensive survey over accommodate complex emotion in the user. This work investigates several LLM-based actions, [2] discovered that social robots the possibility of a more open-ended response elicited higher expectations for more nuanced nonverbal cues selection by leveraging an LLM's internal domain knowledge including a breadth of behavior types. Conveying affects that of emojis and other affective imagery capable of representing are aligned with the user's emotional state can be critical emotional states. We also employ recent advances in visionlanguage in building trust around experienced empathy and personalization models with an image or camera input, as suggested from a social robot [3]. Multi-modal feedback have in [2] and [4]. Additionally, we evaluate both motion and profound impacts on successful empathetic interaction, as color [7] pattern elicitation through atomic action selection notions inferred from robot actions can be understood much [5], [6]. We selected these decision categories based on a easier with systematic actions taken in alignment with an theoretical robot design that could contain an LED strip emotional response [2], [4].


Do LLMs suffer from Multi-Party Hangover? A Diagnostic Approach to Addressee Recognition and Response Selection in Conversations

arXiv.org Artificial Intelligence

Assessing the performance of systems to classify Multi-Party Conversations (MPC) is challenging due to the interconnection between linguistic and structural characteristics of conversations. Conventional evaluation methods often overlook variances in model behavior across different levels of structural complexity on interaction graphs. In this work, we propose a methodological pipeline to investigate model performance across specific structural attributes of conversations. As a proof of concept we focus on Response Selection and Addressee Recognition tasks, to diagnose model weaknesses. To this end, we extract representative diagnostic subdatasets with a fixed number of users and a good structural variety from a large and open corpus of online MPCs. We further frame our work in terms of data minimization, avoiding the use of original usernames to preserve privacy, and propose alternatives to using original text messages. Results show that response selection relies more on the textual content of conversations, while addressee recognition requires capturing their structural dimension. Using an LLM in a zero-shot setting, we further highlight how sensitivity to prompt variations is task-dependent.


Multi-turn Response Selection with Commonsense-enhanced Language Models

arXiv.org Artificial Intelligence

As a branch of advanced artificial intelligence, dialogue systems are prospering. Multi-turn response selection is a general research problem in dialogue systems. With the assistance of background information and pre-trained language models, the performance of state-of-the-art methods on this problem gains impressive improvement. However, existing studies neglect the importance of external commonsense knowledge. Hence, we design a Siamese network where a pre-trained Language model merges with a Graph neural network (SinLG). SinLG takes advantage of Pre-trained Language Models (PLMs) to catch the word correlations in the context and response candidates and utilizes a Graph Neural Network (GNN) to reason helpful common sense from an external knowledge graph. The GNN aims to assist the PLM in fine-tuning, and arousing its related memories to attain better performance. Specifically, we first extract related concepts as nodes from an external knowledge graph to construct a subgraph with the context response pair as a super node for each sample. Next, we learn two representations for the context response pair via both the PLM and GNN. A similarity loss between the two representations is utilized to transfer the commonsense knowledge from the GNN to the PLM. Then only the PLM is used to infer online so that efficiency can be guaranteed. Finally, we conduct extensive experiments on two variants of the PERSONA-CHAT dataset, which proves that our solution can not only improve the performance of the PLM but also achieve an efficient inference.


DivTOD: Unleashing the Power of LLMs for Diversifying Task-Oriented Dialogue Representations

arXiv.org Artificial Intelligence

Language models pre-trained on general text have achieved impressive results in diverse fields. Yet, the distinct linguistic characteristics of task-oriented dialogues (TOD) compared to general text limit the practical utility of existing language models. Current task-oriented dialogue pre-training methods overlook the one-to-many property of conversations, where multiple responses can be appropriate given the same conversation context. In this paper, we propose a novel dialogue pre-training model called DivTOD, which collaborates with LLMs to learn diverse task-oriented dialogue representations. DivTOD guides LLMs in transferring diverse knowledge to smaller models while removing domain knowledge that contradicts task-oriented dialogues. Experiments show that our model outperforms strong TOD baselines on various downstream dialogue tasks and learns the intrinsic diversity of task-oriented dialogues.